Machine Learning

Introduction

To better understand the impact of pandemic on long-distance commuting patterns, our team used a Random Forest model to delve deeper into the underlying characteristics that influence OD-distance (origin-destination distance). The analysis aims to shed light on the intricate relationship between workplaces and neighborhoods. This modeling approach allows us to test our hypotheses: the escalating job opportunities in New York City have propelled a surge in housing prices, intensifying an imbalanced home-job dynamic. This gravitational pull has created a pronounced ‘siphoning effect’ on the surrounding metro region. We predict that due to factors like the high-density job market, higher incomes, and soaring housing prices within NYC, plus the WFH new pattern after the Pandemic, in the future, more people will opt to live in the surrounding areas while commuting to work in the central area of New York City.

Feature Engineering & Modeling

To develop a Random Forest Regression model using 2021 data, we first split it into a training dataset (70%) and a testing dataset (30%). We decided to integrate various factors that contribute to long-distance commuting predictions. Through machine learning models, we can understand the most important features for long-distance commuting. We also compare the significant features before and after COVID-19 to see if there are any differences. Based on the dataset obtained from the first step, we ultimately selected the following features:

  • Population Data: Population data is closely related to work, income, and wages, making it an important variable for predicting future work distribution.
  • Income: Personal income is also a crucial factor influencing people’s choices for long-distance commuting.
  • Population Density: Work is closely related to population density, and future population trends will impact future employment patterns.
  • Age: The age of residents signifies how much of the population constitutes the labor force.
  • Housing Prices: Many people choose to live and work in the surrounding areas of New York for leisure. Therefore, housing prices could be an important factor.
  • Work Patterns: The number and types of job positions are important factors attracting individuals.
  • Total Job Positions: In general, if an area has a lot of job positions, it tends to be attractive to people.
  • Job Types: Job types sometimes reflect an individual’s work pattern, which could be another influencing factor.

Under that, we create a correlation matrix to examine the correlation between those numerical variables.

We tried to identify the most influential factors. From the graph below, we can find that the most critical factors are the number of job positions, median housing value, and population density. This aligns with our initial assumptions, as these features to some extent reflect the job market, worker income, and housing prices. In terms of geographical location, Staten Island and Manhattan are of higher importance. This suggests that the model may perform better or be more applicable in these areas.

Model Performance

We intend to demonstrate OD-distance based on the workplace to gain a clearer understanding of the relationship between job distribution and community dynamics by building Random Forest model.

To further improve the model, we used 5-fold cross-validation to find the best model: depth = 10, estimators = 30. The result for the best model’s R^2 is 0.24, this is not so high because the LODES data is not precise figure but just intervals. However, the importance of features still make sense to our research.

After finding the best model, we calculated the predicted od distance and percentage error for each working tract.

We find that the distribution of model errors is very regionally characterized. Overall, the model predictions in the Brooklyne and Queens districts are skewed high, while most of the other districts are skewed low. This is related to the residential character of the Brooklyne, Queens districts, which, due to their positioning, results in these two districts not having a large number of jobs.

Additionally, Manhattan Island has a low overall error and generally has lower predicted values than true values. This is because Manhattan Island is one of the core areas for employment in New York and has a higher job attraction for citizens.

Make this Notebook Trusted to load map: File -> Trust Notebook

Extra Exploration: will the subway influence our project?

In addition, since OD distance is not exactly equal to commuting distance, the local transportation facilities in the area will also have a greater impact on OD distance. So, our team want to explore if there is the relationship between the percentage of error in our model and the subway station distribution because if there is a local subway station, it means an increase in commuting capacity, which may attract more people who commute in from far away. So, we made a map, overlaying subway station points on median percentage error map.